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ABSTRACT 

Given the growing importance of large-scale graph analytics, there 
is a need to improve the performance of graph analysis frameworks 
without compromising on productivity. GraphMat is our solution 
to bridge this gap between a user-friendly graph analytics frame¬ 
work and native, hand-optimized code. GraphMat functions by tak¬ 
ing vertex programs and mapping them to high performance sparse 
matrix operations in the backend. We get the productivity bene¬ 
fits of a vertex programming framework without sacrificing perfor¬ 
mance. GraphMat is in C++, and we have been able to write a 
diverse set of graph algorithms in this framework with the same ef¬ 
fort compared to other vertex programming frameworks. GraphMat 
performs 1.2-7X faster than high performance frameworks such as 
GraphLab, CombBLAS and Galois. It achieves better multicore 
scalability (13-15X on 24 cores) than other frameworks and is 1.2X 
off native, hand-optimized code on a variety of different graph al¬ 
gorithms. Since GraphMat performance depends mainly on a few 
scalable and well-understood sparse matrix operations, GraphMat 
can naturally benefit from the trend of increasing parallelism on 
future hardware. 


1. INTRODUCTION 

Studying relationships among data expressed in the form of graphs 
has become increasingly important. Graph processing has become 
an important component of bioinformatics fTb) , social network anal¬ 
ysis j^l^, traffic engineering pO| etc. With graphs getting larger 
and queries getting more complex, there is a need for graph analy¬ 
sis frameworks to help users extract the information they need with 
minimal programming effort. 

There has been an explosion of graph programming frameworks 
in recent years |[^[^|^|^[^[^. All of them claim to provide good 
productivity, performance and scalability. However, a recent study 
has shown |27| that the performance of most frameworks is off by 
an order of magnitude when compared to native, hand-optimized 
code. Given that much of this performance gap remains even when 


running frameworks on a single node |27| , it is imperative to max¬ 
imize the efficiency of graph frameworks on existing hardware (in 
addition to focusing on scale out issues). GraphMat is our solution 
to bridge this performance-productivity gap in graph analytics. 

The main idea of GraphMat is to take vertex programs and map 
them to a generalized sparse matrix vector multiplication opera¬ 
tion. We get the productivity benefits of vertex programming while 
enjoying the high performance of a matrix backend. In addition, 
it is easy to understand and reason about, while letting users with 
knowledge of vertex programming a smooth transition to a high 
performance environment. Although other graph frameworks based 
on matrix operations exist (e.g. CombBLAS 0 and PEGASUS 
fT^), GraphMat wins out in terms of both productivity and per¬ 
formance as GraphMat is faster and does not expose users to the 
underlying matrix primitives (unlike CombBLAS and PEGASUS). 
We have been able to write multiple graph algorithms in GraphMat 
with the same effort as other vertex programming frameworks. 

Our contributions are as follows: 

1. GraphMat is the first vertex programming model to achieve 
within 1.2X of native, hand-coded, optimized code on a va¬ 
riety of different graph algorithms. GraphMat is 5-7X faster 
than GraphLab |5|j & CombBLAS and 1.2X faster than Ga¬ 
lois 0 on a single node. 

2. GraphMat achieves good multicore scalability, getting a 13- 
15X speedup over a single threaded implementation on 24 
cores. In comparison, GraphLab, CombBLAS, and Galois 
scale by only 2-12X over their corresponding single threaded 
implementations. 

3. GraphMat is productive for both framework users and de¬ 
velopers. Users do not have to learn a new programming 
paradigm (most are familiar with vertex programming), whereas 
backend developers have fewer primitives to optimize as it is 
based on Sparse matrix algebra, which is a well-studied op¬ 
eration in High Performance Computing (HPC) p^ . 

Matrices are fast becoming one of the key data structures for 
databases, with systems such as SciDB 0 and other array stores 
becoming more popular. Our approach to graph analytics can take 
advantage of these developments, letting us deal with graphs as 
special cases of sparse matrices. Such systems offer transactional 
support, concurrency control, fault tolerance etc. while still main¬ 
taining a matrix abstraction. We offer a path for array processing 
systems to support graph analytics through popular vertex program¬ 
ming frontends. 
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Basing graph analytics engines on generalized sparse matrix vec¬ 
tor multiplication (SPMV) has other benefits as well. We can lever¬ 
age decades of research on techniques to optimize sparse linear al¬ 
gebra in the High Performance Computing world. Sparse linear 
algebra provides a bridge between Big Data graph analytics and 
High Performance Computing. Other efforts like GraphBLAS 
are also part of this growing effort to leverage lessons learned from 
HPC to help big data. 

The rest of the paper is organized as follows. Sectionj^provides 
motivation for GraphMat and compares it to other graph frame¬ 
works. Section [^discusses the graph algorithms used in the paper. 
Section|^describes the GraphMat methodology in detail. Sectionj^ 
gives details of the results of our experiments with GraphMat while 
Section [^concludes the paper. 

2. MOTIVATION AND RELATED WORK 

Graph programming frameworks come in a variety of different 
programming models. Some common ones are vertex program¬ 
ming (“think like a vertex”), matrix operations (“graphs are sparse 
matrices”), task models (“vertex/edge updates can be modeled as 
tasks”), declarative programming (“graph operations can be writ¬ 
ten as datalog programs”), and domain-specific languages (“graph 
processing needs its own language”). Of all these models, vertex 
programming has been quite popular due to ease of use and the 
wide variety of different frameworks supporting it ( 27 ) 

While vertex programming is generally productive for writing 
graph programs, it lacks a strong mathematical model and is there¬ 
fore difficult to analyze for program behavior or optimize for better 
backend performance. Matrix models, on the other hand, are based 
on a solid mathematical foundation i.e. graph traversal computa¬ 
tions are modeled as operations on a semi-ring 0. CombBLAS 
0 is an extensible distributed-memory parallel graph library of¬ 
fering a set of linear algebra primitives specifically targeting graph 
analytics. While this model is great for reasoning and performing 
optimizations, it is seen as hard to program. As shown in p7) , some 
graph computations such as triangle counting are hard to express 
efficiently as a pure matrix operation, leading to long runtimes and 
increased memory consumption. 

In the High Performance Computing world, sparse matrices are 
widely used in simulations and modeling of physical processes. 
Sparse matrix vector multiply (SPMV) is a key kernel used in op¬ 
erations such as linear solvers and eigensolvers. A variety of opti¬ 
mizations have been performed to improve SPMV performance on 
single and multiple nodes p^ . Existing matrix-based graph analyt¬ 
ics operations achieve nowhere near the same performance as these 
optimized routines. Our goal is to achieve “vertex programming 
productivity with HPC-like performance for graph analytics”. 

There have been a large number of frameworks proposed for 
graph analytics recently, and these differ both in terms of program¬ 
ming abstractions as well as underlying implementations. There 
has been recent work p7| th at has compared different graph frame¬ 
works including Giraph Q and GraphLab which are two 

popular vertex programming models; CombBLAS [TT], a ma¬ 
trix programming model; Socialite j^, a functional programming 
model; and Galois (HEiiD, a task-based abstraction. That paper 
shows that CombBLAS and Galois generally perform well com¬ 
pared to other frameworks. Moreover, the ability to map many di¬ 
verse graph operations to a small set of matrix operations means 
that the backend of CombBLAS is easy to maintain and extend - 
for example to multiple nodes (Galois does not yet have a multi¬ 
node version). Hence, in terms of performance, we can conclude 
that matrix-based abstractions are clearly a good choice for graph 
analytics. Matrices are becoming an important class of objects in 


databases. Our technique of looking at graph algorithms as gener¬ 
alizations of sparse matrix algebra leads to a simple way to connect 
graph stores to array databases. We believe the rise of sparse array 
based databases will also help the use of graph storage and analyt¬ 
ics. 

There are other matrix based frameworks such as PEGASUS 
for graph processing. PEGASUS is based on Map-Reduce and 
suffers from poor performance due to I/O bottlenecks compared to 
in-memory frameworks. Other domain specific languages such as 
GreenMarl GD purport to improve productivity and performance, 
but at the cost of a having to learn a new programming language. 
Some other ways to process graphs include writing vertex programs 
as UDEs for use in a column store GZ) and GraphX |[^ (set of 
graph primitives intended to work with Spark j^). The popularity 
and adoption of vertex based programming models (for instance, 
Eacebook uses Giraph fT^ ) establishes the case for vertex-based 
models over other alternatives. 

In this work, we try to adopt the best of both worlds, and we com¬ 
pare ourselves to high performing vertex programming and matrix 
programming models (GraphLab and CombBLAS respectively). 
We will focus on comparing GraphMat to GraphLab, CombBLAS 
and Galois for the reminder of this paper. 

3. ALGORITHMS 

To showcase the performance and productivity of GraphMat, 
we picked five different algorithms from a diverse set of applica¬ 
tions, including machine learning, graph traversal and graph statis¬ 
tics. Our choice covers a wide range of varying functionality (e.g. 
traversal or statistics), data per vertex, amount of communication, 
iterative vs. non iterative etc. We give a brief summary on each 
algorithm below. 

I. Page Rank (PR): This is an iterative algorithm used to rank 
web pages based on some metric (e.g. popularity). The idea is 
compute the probability that a random walk through the hyperlinks 
(edges) would end in a particular page (vertex). The algorithm iter¬ 
atively updates the rank of each vertex according to the following 
equation: 


PR*+\v) = r + (l 


r) * Y, 

u\{u,v)£E 


PR\u) 
degree( m) 


( 1 ) 


where PR^{v) denotes the page rank of vertex v at iteration t, 
E is the set of edges in a directed graph, and r is the probability of 
random surfing. The initial ranks are set to 1.0. 


II. Breadth First Search (BFS): This is a very popular graph 
search algorithm, which is also used as the kernel by the GraphSOO 
benchmark p^ . The algorithm begins at a given vertex (called 
root) and iteratively explores all connected vertices of an undirected 
and unweighted graph. The idea is to assign a distance to each ver¬ 
tex, where the distance represents the minimum number of edges 
needed to be traversed to reach the vertex from the root. Initially, 
the distance of the root is set to 0 and it is marked active. The other 
distances are set to infinity. At iteration t, each vertex adjacent to 
an active vertex computes the following: 


Distance{v) — mhi{Distarice{v)^t + 1) (2) 

If the update leads to a change in distance (from infinity to t +1), 
then the vertex becomes active for the next iteration. 

III. Collaborative Filtering (CF): This is a machine learning 
algorithm used by many recommender systems p6| for estimat- 
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ing a user’s rating for a given item based on an incomplete set of 
(user, item) ratings. The underlying assumption is that users’ rat¬ 
ings are based on a set of hidden/latent features and each item can 
be expressed as a combination of these features. Ratings depend on 
how well the user’s and item’s features match. Given a matrix G 
of ratings, the goal of collaborative filtering technique is to com¬ 
pute two factors Pu and Pv, each one is a low-dimensional dense 
matrix. This can be accomplished using incomplete matrix factor¬ 
ization fl^ . Mathematically, the problem can be expressed as eq. 

where u and v are the indices of the users and items, respec¬ 
tively, is the rating of the user for the item, pn&p^; 
are dense vectors of length K corresponding to each user and item, 
respectively. 


min (Gn^; - p^p^;)^ + A||pu||^ + A||p^;||^ (3) 

{u,v)eG 

Matrix factorization is usually performed iteratively using Stochas¬ 
tic Gradient Descent (SGD) or Gradient Descent (GD). In each it¬ 
eration t, GD performs Equation]^- [^for all users and items. SGD 
performs the same updates without the summation in equation 
on all ratings in a random order. The main difference between GD 
and SGD is that GD updates all the pu and p^; once per iteration 
instead of once per rating as in SGD. 


T 

Guv — ^uv Pu 

(4) 

Pn — Pn T T[ ^ ^ GuvPv Ap^^] 

(5) 

(u,v)EG 


P* = P^; + 7[ y2 ^UVPU - Ap^;] 

(6) 


iu,v)eG 


IV. Triangle Counting (TC): This is a statistics algorithm use¬ 
ful for understanding social networks, graph analysis and comput¬ 
ing clustering coefficient. The algorithm computes the number of 
triangles in a given graph. A triangle exists when a vertex has two 
adjacent vertices that are also adjacent to each other. The technique 
used to compute the number of triangles is as follows. Each vertex 
shares its neighbor list with each of its neighbors. Each vertex then 
computes the intersection between its neighbor list and the neigh¬ 
bor list(s) it receives. Eor a given directed graph with no cycles, the 
size of the intersections gives the number of triangles in the graph. 
When the graph is undirected, then each vertex in a triangle con¬ 
tributes to the count, hence the size of the intersection is exactly 
3 times the number of triangles. The problem can be expressed 
mathematically as follows, where Euv denotes the presence of an 
(undirected) edge between vertex u and vertex v. 


Distance{v) = min {Distance{u)w{u,v)} (8) 

u\(u,v)EE 

Where w{u, v) represents the weight of the edge (u, v). Initially 
the Distance for each vertex is set to infinity except the source 
with Distance value set to 0. We use a slight variation on the 
Bellman-Eord shortest path algorithm where we only update the 
distance of those vertices that are adjacent to those that changed 
their distance in the previous iteration. 

We now discuss the implementation of GraphMat and its opti¬ 
mizations in the next section. 

4. GRAPHMAT 

GraphMat is based on the idea that graph analytics via vertex 
programming can be performed through a backend that supports 
only sparse matrix operations. GraphMat takes graph algorithms 
written as vertex programs and performs generalized sparse matrix 
vector multiplication on them (iteratively in many cases). This is 
possible as edge traversals from a set of vertices can be written 
as sparse matrix-sparse vector multiplication routines on the graph 
adjacency matrix (or its transpose). To illustrate this idea, a simple 
example of calculating in-degree is shown in Eigure[^ Multiplying 
the transpose of the graph adjacency matrix (unweighted graph) 
with a vector of all ones produces a vector of vertex in-degrees. To 
get the out-degrees, one can multiply the adjacency matrix with a 
vector of all ones. 
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(7) 

V. Single Source Shortest Path (SSSP): This is another graph 
algorithm used to compute the shortest paths from a single source 
to all other vertices in a given weighted and directed graph. The 
algorithm is used in many applications such as finding driving di¬ 
rections in maps or computing the min-delay path in telecommuni¬ 
cation networks. Similar to BES, the algorithm starts with a given 
vertex (called source) and iteratively explores all the vertices in the 
graph. The idea is to assign a distance value to each vertex, which is 
the minimum edge weights needed to reach a particular vertex from 
the source. At each iteration t, each vertex performs the following: 


Figure 1: Graph (a) Logical representation (b) Adjacency ma¬ 
trix (c) In-degree calculation as SPMV G^x = y. Vector x is 
all ones. The output vector y indicates the number of incoming 
edges for each vertex. 


4.1 Mapping Vertex Programs to Generalized 
SPMV 

The high-level scheme for converting vertex programs to sparse 
matrix programs is shown in Eigurej^ We observe that while vertex 
programs can have slightly different semantics, they are all equiv¬ 
alent in terms of expressibility. Our vertex programming model is 
similar to that of Giraph (T). 
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Figure 2: Overview of vertex program to Sparse matrix vector 
multiply conversion. 


A typical vertex program has a state associated with each ver¬ 
tex that is updated iteratively. Each iteration starts with a subset of 
vertices that are “active” i.e. whose states were updated in the last 
iteration, which now have to broadcast their current state (or a func¬ 
tion of their current state) to their neighboring vertices. A vertex 
receiving such “messages” from its neighbors processes each mes¬ 
sage separately and reduces them to a single value. The reduced 
value is used to update the current state of the vertex. Vertices that 
change state then become active for the next iteration. The itera¬ 
tive process continues for a fixed number of iterations or until no 
vertices change state (user-specified termination criterion). We fol¬ 
low the Bulk-synchronous parallel model i.e. each iteration can be 
considered a superstep. 

The user specifies the following for a graph program in Graph- 
Mat - each vertex has user-defined property data that is initialized 
(based on the algorithm used). A set of vertices are marked ac¬ 
tive. The user-defined function Send_Message() reads the vertex 
data and produces a message object (done for each active vertex), 
Process_Message() reads the message object, edge data along 
which the message arrived, and the destination vertex data and pro¬ 
duces a processed message for that edge. The Reduce() function 
is typically a commutative function taking in the processed mes¬ 
sages for a vertex and producing a single reduced value. Apply() 
reads the reduced value and modifies its vertex data (done for each 
vertex that receives a message). Send_Mess AGE() can be called to 
scatter along in- and/or out- edges. We found that this model was 
sufficient to express a large number of diverse graph algorithms 
efficiently. The addition of access to the destination vertex data 
in Process_Message() makes algorithms like triangle counting 
and collaborative filtering easier to express than traditional matrix 
based frameworks such as CombBLAS. See Section for more 
details. 

Figure shows an example of single source shortest path exe¬ 
cuted using the user-defined functions used in GraphMat. We cal¬ 
culate the shortest path to all vertices from source vertex A. At a 
given iteration, we generate a sparse vector using the Send_Message() 
function on the active vertices. The message is the shortest dis¬ 
tance to that vertex calculated so far. Process_Message() adds 
this message to the edge length, while Reduce() performs a min 
operation. Process_Message() and Reduce() together form a 
sparse matrix sparse vector multiply operation replacing traditional 
SPMV multiply operation with addition and SPMV addition with 
min respectively. Source code for the SSSP algorithm in GraphMat 
is provided in the appendix. 


4.2 Generalized SPMV 

As shown in Figures anda generalized sparse matrix vector 
multiplication helps implement multiple graph algorithms. These 
examples, though simple, illustrate that overloading the multiply 
and add operations of a SPMV can produce different graph algo¬ 
rithms. In this framework, a vertex program with Process_Message 
and Reduce functions can be written as a generalized SPMV. As¬ 
suming that the graph adjacency matrix transpose is stored in a 
Compressed Sparse Column (CSC) format, a generalized SPMV is 
given in Algorithm We can also partition this matrix into many 
chunks to improve parallelism and load balancing. 


Algorithm 1 Generalized SPMV 

1: function SPMV(Graph G, SparseVector x, Process_Message, Re¬ 
duce) 

2: y new SparseVector() 

3 : for j in .columnJndices do 

4: if j is present in x then 

5: for k in G^.columnj do 

6: result ^ PROCESS_MESSAGE(xj, G.edge_value(fc, j), 

G . get VertexProperty (A;)) 

7: Yk ^ REDUCE(yfc, result) 

return y 


We implement SPMV by traversing the non-zero columns in G^. 

If a particular column j has a corresponding non-zero at position j 
in the sparse vector, then the elements in the column are processed 
and values accumulated in the output vector y. 

GraphMat’s main advantage over other matrix based frameworks 
is that it is easy for the user to write different graph programs with 
a vertex program abstraction. With other matrix-based frameworks 
such as CombBLAS |[3J and PEGASUS fTS) , the user defined func¬ 
tion to process a message (equivalent to GraphMat’s Process_Message) 
can only access the message itself and the value of the edge along 
which it is received (similar to the example in Figure [^. This is 
very restrictive for many algorithms esp. Collaborative filtering and 
Triangle counting. In GraphMat, the message processing function 
can access the property data of the vertex receiving the message in 
addition to the message and edge value. We have found that this 
makes it very easy to write different graph algorithms with Graph- 
Mat. While one could technically achieve vertex data access dur¬ 
ing message processing with CombBLAS, it involves non-trivial 
accesses to the internal data structures that CombBLAS maintains, 
adding to coding complexity of pure matrix based abstractions. For 
example with triangle counting, a straightforward implementation 
in CombBLAS uses a matrix-matrix multiply which results in long 
runtimes and high memory consumption (23 Triangle Counting 
in GraphMat works as two vertex programs. The first creates an 
adjacency list of the graph (this is a simple vertex program where 
each vertex sends out its id, and at the end stores a list of all its 
incoming neighbor id’s in its local state). In the second program, 
each vertex simply sends out this list to all neighbors, and each 
vertex intersects each incoming list with its own list to find trian¬ 
gles (as described in Sectionj^IV). This approach is more efficient 
and is faster. Similar issues occur with implementing Collaborative 
Filtering in CombBLAS as well. 

4.3 Overall framework 

The overall GraphMat framework is presented in Algorithm 
The set of active vertices is maintained using a boolean array for 
performance reasons. In each iteration, this array is scanned to find 
the active vertices and a sparse vector of messages is generated. 

Then, a generalized SPMV is performed using this vector. The 
resulting output vector is used to update the state of the vertices. 
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SEND_MESSAGE : message := vertex_distance 
PROCESS_MESSAGE : result := message + edge_value 
REDUCE : result := min(result, operand) 

APPLY : vertex_distance = min(result, vertex_distance) 


(c) 



Figure 3: Example: Single source shortest path, (a) Graph with weighted edges, (b) Transpose of adjacency matrix (c) Abstract 
GraphMat program to find the shortest distance from a source, (d) We find the shortest distance to every vertex from vertex A. Each 
iteration shows the matrix operation being performed (Process_Message and Reduce). Dashed entries denote edges/messages 
that do not exist (not computed). The final vector (after Apply) is the shortest distance calculated so far. On the right, we show the 
operations on the graph itself. Dotted lines show the edges that were processed in that iteration. Vertices that change state in that 
iteration and are hence active in the next iteration are shaded. The procedure ends when no vertex changes state. Eigure best viewed 
in color. 
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If any vertices change state, they are marked active for the next 
iteration. The algorithm continues for a user-specified maximum 
number of iterations or until convergence (no vertices change state). 


Algorithm 2 GraphMat overview, x, y are sparse vectors. 

1: function RuN_GRAPH_PROGRAM(Graph G, GraphProgram P) 

2: for i = lio M ax Iterations do 

3: for i; = 1 to y ertices do 

4: if is active then 

5: x^; ^ P.SEND_MESSAGE('f;, G) 

6: y ^ SPMV(G, x, P.Process_Message, P.Reduce) 

7: Reset active for all vertices 

8: for j = 1 to y.length do 

9: V ^ y.getVertex(j) 

10: oldjuertexproperty ^ G.getVertexProperty(r’) 

11: G.setVertexProperty(r;, y.getValue(j), P. Apply) 

12: if G.getVertexProperty(^’) / old-vertexproperty then 

13: V seiio active 

14: if Number of active vertices == 0 then 

15: break 


As shown in Algorithmj^ GraphMat follows an iterative process 
of S END _Mes SAGE (lines 3-5), SPMV (line 6), and Apply (lines 
8-13). Each such iteration is a superstep. 

4.4 Data structures 

We describe the sparse matrix and sparse vector data structures 
in this section. 

4.4.1 Sparse Matrix 

We represent the sparse matrix in the Doubly Compressed Sparse 
Column (DCSC) format 0 which can store very large sparse ma¬ 
trices efficiently. It primarily uses four arrays to store a given ma¬ 
trix as briefiy explained here: one array to store the column indices 
of the columns which have at-least one non-zero element, two ar¬ 
rays storing the row indices (where there are non-zero elements) 
corresponding to each of the above column indices and the non¬ 
zero values themselves, and another array to point where the row- 
indices corresponding to a given column index begin in the above 
array (allowing access to any non-zero element at a given column 
index and a row index if it is present). The format also allows an 
optional array to index the column indices with non-zero elements, 
which we have not used. For more details and examples, please 
see The DCSC format has been used effectively in parallel 
algorithms for problems such as Generalized sparse matrix-matrix 
multiplication (SpGEMM) fT0|, and is part of the Combinatorial 
BLAS (CombBLAS) library |lT]. The matrix is partitioned in a 
1-D fashion (along rows), and each partition is stored as an inde¬ 
pendent DCSC structure. 

4.4.2 Sparse Vector 

Sparse Vectors can be implemented in many ways. Two good 
ways of storing sparse vectors are as follows: (1) A variable sized 
array of sorted (index, value) tuples (2) A bitvector for storing valid 
indices and a constant (number of vertices) sized array with values 
stored only at the valid indices. Of these, the latter option pro¬ 
vides better performance across all algorithms and graphs and so is 
the only option considered for the rest of the paper. In the SPMV 
routine in Algorithm line 4 becomes faster due to use of the 
bitvector. Since the bitvector can also be shared among multiple 
threads and can be cached effectively, it also helps in improving 
parallel scalability. The performance gain from this bitvector use is 
presented in Section 


4.5 Optimizations 

Some of the optimizations performed to improve the performance 
of GraphMat are described in this section. The most important op¬ 
timizations improve the performance of the SPMV routine as it ac¬ 
counts for most of the runtime. 

1. Cache optimizations such as the use of bitvectors for storing 
sparse vectors improve performance. 

2. Since the generalized SPMV operations (Process_Message 
and Reduce) are user-defined, using the compiler option to 
perform inter-procedural optimizations (-ipo) is essential. 

3. Parallelization of SPMV among multiple cores in the system 
increases processing speed. Each partition of the matrix is 
processed by a different thread. 

4. Load balancing among threads can be improved through bet¬ 
ter partitioning of the adjacency matrix. We partition the ma¬ 
trix into many more partitions than number of threads along 
with dynamic scheduling to distribute the SPMV load among 
threads better. Without this load balancing, the number of 
graph partitions equals number of threads. 

We now discuss the experimental setup, datasets used and the 
results of our comparison to other graph frameworks. 

5. RESULTS 

5.1 Experimental setup 

We performed the experimentsj^on an Intel® Xeon® 0E5-2697 
v2 based system. The system contains two processors, each with 
12 cores running at 2.7GHz (24 cores in total) sharing 30 MB L3 
cache and 64 GB of memory. The machine runs Red Hat Enterprise 
Linux Server OS release 6.5. We used the Intel® C-i-i- Composer 
XE 2013 SPl Compiler]^ and the Intel® MPI library 5.0 to com¬ 
pile the native and benchmark code. We used GraphLab v2.2 ||^, 
CombBLAS vl.3 0 and Galois v2.2.0 0 for performance com¬ 
parisons. In order to utilize multiple threads on the CPU, Graph- 
Mat and Galois use OpenMP only, GraphLab uses both OpenMP 
and MPI, and CombBLAS uses MPI only. Since CombBLAS re¬ 
quires the total number of processes to be a square (due to their 2D 
partitioning approach), we use 16 MPI processes to run on the 24 
cores system (hence 8 cores remain idle). We found that running 
CombBLAS with 25 MPI processes using 24 cores yields worse 
performance than running with 16 processes. However the native 
code, GraphMat, Galois, and GraphLab use the entire system (24 
cores). 

^Software and workloads used in performance tests may have been optimized for 
performance only on Intel microprocessors. Performance tests, such as SYSmark 
and MobileMark, are measured using specific computer systems, components, soft¬ 
ware, operations and functions. Any change to any of those factors may cause the 
results to vary. You should consult other information and performance tests to as¬ 
sist you in fully evaluating your contemplated purchases, including the performance 
of that product when combined with other products. For more information go to 
http://www.intel.com/performance 
2 

Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. 

3 

Intel’s compilers may or may not optimize to the same degree for non-Intel micro¬ 
processors for optimizations that are not unique to Intel microprocessors. These op¬ 
timizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. 
Intel does not guarantee the availability, functionality, or effectiveness of any opti¬ 
mization on microprocessors not manufactured by Intel. Microprocessor-dependent 
optimizations in this product are intended for use with Intel microprocessors. Certain 
optimizations not specific to Intel micro-architecture are reserved for Intel micropro¬ 
cessors. Please refer to the applicable product User and Reference Guides for more 
information regarding the specific instruction sets covered by this notice. Notice revi¬ 
sion #20110804 


6 








Datasets: We used a mix of real-world and synthetic datasets 
for our evaluations. Real-world datasets include Facebook interac¬ 
tion graphs (^, the Netflix challenge for collaborative filtering 0, 
USA road network for California and Nevada ||7) and the Livejour- 
nal, Wikipedia, Delaunay and Flickr graphs from the University of 
Florida Sparse Matrix collection (H- We chose these datasets pri¬ 
marily to match those used in previous work p7] [T^ so that 
valid performance comparisons can be made. Table^provides de¬ 
tails on the datasets used, as well as the algorithms run on these 
graphs. 

Since many real-world datasets are small in size, we augmented 
them with synthetic datasets obtained from the Graph500 RMAT 
data generator | |^ . We adjust the RMAT parameters A,B,C,D de¬ 
pending on the algorithm run (to correspond to previous work). 
Specifically, following we use RMAT parameters A = 0.57, 
B=C= 0.19 (D is always = 1-A-B-C) for generating graphs for 
Pagerank, BFS and SSSP; and different parameters A = 0.45, B=C 
=0.15 for Triangle Counting as in Oz) We generate one additional 
scale 24 graph for SSSP with parameters A=0.50, B=C=0.10 to 
match with that used in |[^|^. Finally, for collaborative filter¬ 
ing, we used the synthetic bipartite graph generator as described 
in (iz) to generate graphs similar in distribution to the real-world 
Netfiix challenge graph. 

Both real-world and synthetic graphs obtained occasionally need 
pre-processing for specific algorithms. We first remove self-loops 
in the graphs. Pagerank and SSSP usually assume all edges in the 
graph are directed and work directly with the graphs obtained. For 
BFS, we replicate edges (if the original graph is directed) to obtain 
a symmetric graph. For Triangle Counting, the input graph is ex¬ 
pected to be directed acyclic; hence we first replicate edges as in 
BFS to make the graph symmetric and then discard the edges in 
the lower triangle of the adjacency matrix. Finally, for collabora¬ 
tive filtering, the graphs have to be bipartite; both the Netfiix graph 
and synthetic graph generators ensure this. By using the same in¬ 
put graphs as previous work, we ensure that we can make direct 
performance comparisons. 


Dataset 

# Vertices 

# Edges 

Algorithms 

Brief Description 

Synthetic _ 

GraphSOO [3] 
RMAT Scafe^ 

1,048,576 

16,746,179 

Tri Count, 

Described in 
Section |5.1 1 

Synthetic 
GraphSOO [23] 
RMAT Scafe^ 

8,388,608 

134,215,380 

Pagerank, BFS, 
SSSP 

Described in 
Section |5.1 1 

Synthetic _ 

GraphSOO 

RMAT Scafe^i 

16,777,216 

267,167,794 

SSSP 

Described in 
Section |5.1 1 

Live Journal |l4] 

4,847,571 

68,993,773 

Pagerank, BFS, 
Tri Count 

Live Journal 
follower graph 

Facebook [3lj 

2,937,612 

41,919,708 

Pagerank, BFS, 
Tri Count 

Facebook user 
interaction graph 

Wikipedia |l4j 

3,566,908 

84,751,827 

Pagerank, BFS, 
Tri Count 

Wikipedia 

Link graph 

Netflix 

480,189 users 
17,770 movies 

99,072,112 

ratings 

Collaborative 

Filtering 

Netflix Prize 

Synthetic 
Collaborative 
Filtering [^ 

63,367,472 users 
1,342,176 items 

16,742,847,256 

ratings 

Collaborative 

Filtering 

Described in 
Section |5.1 1 

Flickr [1^ 

820,878 

9,837,214 

SSSP 

Flickr crawl 

USAroJd n 

(CAL) 

1,890,815 

4,657,742 

SSSP 

DIMACS9 


Table 1: Real World and synthetic datasets 

5.2 Performance Results 

We first compare the runtime performance of GraphMat to other 
frameworks. We demonstrate the performance improvement of 
GraphMat over a common vertex programming framework (GraphLab 
j^), a high performance matrix programming framework (Comb- 
BLAS 1^) and a high performance task based framework (Galois 


0). We then compare GraphMat performance to that of native 
well-optimized hand-coded implementations of these algorithms 
that gets performance limited only by hardware (zz) Finally, we 
show the scalability of GraphMat as compared to GraphLab, Comb- 
BLAS and Galois. 

5.2.1 GraphMat VS. Other frameworks 

As mentioned in Section 0 we selected a diverse set of graph 
algorithms, and used different real-world and synthetic datasets 
for these algorithms (see column ‘Algorithms” in Table 0 for de¬ 
tails) that were selected to be comparable to previous work. We 
report the time taken to run the graph algorithms after loading the 
graph into memory (excluding time taken to read the graph from 
disk). Figure 0 shows the performance results of running Graph- 
Mat, GraphLab, CombBLAS and Galois on these algorithms and 
datasets. The y-axis on these figures are total runtime, except for 
Pagerank and Collaborative Filtering where each algorithm itera¬ 
tion takes similar time and hence we report time/iteration. Since 
we report runtimes, lower bars indicate better performance. 

We note that GraphMat is significantly faster than both GraphLab 
and CombBLAS on most algorithms and datasets. GraphMat is 
faster than Galois on average. As we can see from Figures [4^^ and 
|4(b)| GraphMat is 4-1IX faster than GraphLab on both real-world 
and synthetic datasets for Pagerank and BFS (average of 7.5X for 
Pagerank and 7.9X for BFS). As has been shown previously ( 27 ) on 
these datasets, CombBLAS performs better than GraphLab due to 
its better optimized backend, but GraphMat is still 2-4X better than 
CombBLAS. Compared to Galois, GraphMat is 1.5-4X better on 
Pagerank and ties on BFS. CombBLAS performs poorly in Trian¬ 
gle Counting (Figure [4^c^ , where intermediate results are so large 
as to overflow memory or come close to memory limits; Comb¬ 
BLAS fails to complete for real-world datasets and is about 36X 
slower than GraphMat on the synthetic graph. GraphLab is much 
better optimized for this algorithm due to the use of cuckoo hash 
data structures and is only 1.5X slower than GraphMat on aver¬ 
age. Galois is 20% faster than GraphMat for triangle counting. On 
Collaborative Filtering, Figure [4^d)| shows that GraphMat is about 
7X faster than GraphLab, 4.7X faster than CombBLAS and 1.5X 
faster than Galois. These four algorithms were also studied in ( 27 ) 
and our performance results for GraphLab, CombBLAS and Galois 
closely match the results in that paper. 

We consider an additional algorithm in this paper (SSSP) to in¬ 
crease the diversity of applications covered. For Single Source 
Shortest Path (SSSP), GraphMat is about lOX faster than both GraphLab 
and CombBLAS (Figure [4{e^ . This difference is larger than ones 
seen in other algorithms. This arises in part because some of these 
datasets are such that SSSP takes a lot of iterations to finish with 
each iteration doing a relatively small amount of work (especially 
for Flickr and USA-Road graphs). For such computations, Graph- 
Mat, which has a small per-iteration overhead performs much bet¬ 
ter than other frameworks. For the other datasets that do more work 
per iteration, GraphMat is still 3.6-6.9X better than GraphLab and 
CombBLAS. Galois performs better than GraphMat on SSSP by 
30%. 

Table |2] summarizes these results. We see from the table that the 
geometric mean of the speedup of GraphMat over GraphLab and 
CombBLAS is about 5-7X and speedup over Galois is about 1.2X 
over the range of algorithms and datasets. We defer a discussion of 
the reasons for this performance difference to Section [53] In the 
next section, we describe how this performance compares to that of 
hand-optimized code, and then discuss scalability of GraphMat. 
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Figure 4: Performance results for different algorithms on real-world and synthetic graphs. The y-axis represents runtime (in log-scale), therefore 
lower numbers are better. 



PR 

BFS 

TC 

CF 

SSSP 

Overall 

GraphLab 

7.5 

7.9 

1.5 

7.1 

10.6 

5.8 

CombBLAS 

4.1 

2.2 

36.0 

4.8 

10.2 

6.9 

Galois 

2.6 

1.0 

0.8 

1.5 

0.7 

1.2 


Table 2: Summary of performance improvement of GraphMat 
over GraphLab, CombBLAS and Galois. Higher values mean 
GraphMat is faster. 

5 . 2.2 GraphMat vs. Native 

We now compare GraphMat performance to that of hand-optimized 
native implementations. We took the performance results of native 
PageRank, BFS, Triangle counting, and collaborative filtering im¬ 
plementations from p7| , since we used the same datasets and ma¬ 
chines with identical configuration to that work. Tablej^ shows the 
results of our comparison with the geometric mean over all datasets 
for each algorithm. The table shows the slowdown of GraphMat 
with respect to native code. We can see that GraphMat is compa¬ 
rable in performance for Pagerank and BFS. For Collaborative Fil¬ 
tering, GraphMat is faster than native code in terms of runtime per 
iteration. This is because the native performance results from dz) 
are for Stochastic Gradient Descent (SGD) as opposed to Gradient 
Descent (GD) for GraphMat, and GD is more easily parallelizable 
than SGD. This is refiected in our performance results. 

Table shows that, on average, GraphMat is only 1.2X slower 
than native code. It should be noted that hand-optimized native 
code typically requires significant effort to write even for expert 
users. Moreover, the effort is not usually very portable across algo¬ 


rithms, and very specific tuning has to be done for each algorithm 
and machine architecture. The efforts described in (27) are indeed 
difficult to perform for an end-user of a graph framework. However, 
GraphMat abstracts away all these optimizations from the user who 
only sees a vertex program abstraction (SSSP example in Figure 
and appendix gives an indication of the effort involved). Hence we 
are able to get close to the same performance as native code with 
much lower programming effort. 


Algorithm 

Slowdown compared to 
native code in (27| 

PageRank 

1.15 

Breadth First Search 

1.18 

Triangle Counting 

2.10 

Collaborative Filtering 

0.73 

Overall (Geomean) 

1.20 


Table 3: Comparison of GraphMat performance to native, op¬ 
timized code. 

We additionally compare with a recently published SSSP GPU 
implementation written in CUDA GD The Workfront sweep method 
described in GD matches our SSSP implementation. When run 
on the same Flickr and Graph500 RMAT Scale 24 graphs, Graph- 
Mat is 1.4X and 2.IX slower than the CUDA implementation on an 
Nvidia GTX 680 GPU, in spite of the 3X higher compute and mem¬ 
ory bandwidth of the GPU compared to the system we run on. This 
shows that GraphMat utilizes hardware resources more efficiently 
than the optimized CUDA implementation. 



























5.2.3 Scalability 

As most performance improvements across recent processor gen¬ 
erations have come from increasing core counts, it has become im¬ 
portant to consider scalability when choosing application frame¬ 
works as an end-user. In this context, we now discuss the scalabil¬ 
ity of GraphMat and compare it to that of GraphLab, CombBLAS 
and Galois. Figure]^ shows the scalability results for two represen¬ 
tative applications - Pagerank and SSSP. We see that no framework 
scales perfectly linearly with cores, but this is expected since there 
are shared resources like memory bandwidth that limit the scaling 
of graph workloads. However, among the frameworks, we can see 
that GraphMat scales about 13-15X on 24 cores, while GraphLab, 
CombBLAS and Galois only scale about 8X, 2-6X and 6-12X re¬ 
spectively. The trends for other applications are similar. As a result, 
on future platforms with increasing core counts, we expect Graph- 
Mat to continue to outperform GraphLab, CombBLAS and Galois. 


^GraphMat -•-GraphLab 
-t-CombBLAS -♦-Galois 
16 



Cores 

(a) PageRank 


♦GraphMat ♦GraphLab 
♦CombBLAS ♦Galois 

16 



Cores 

(b) Single Source Shortest Path 


Figure 5: Scalability of the frameworks using pagerank and single 
source shortest path algorithms on facebook and flickr datasets respec¬ 
tively. 

These scaling results do not completely account for the better 
performance of GraphMat over GraphLab and CombBLAS; Graph- 
Mat performs better than most frameworks even when all frame¬ 
works are run on a single thread (for example, single-threaded Graph- 
Mat is 2-2.5X faster than CombBLAS and about 8-12X faster than 
GraphLab). Hence even the baseline for Figurej^is generally better 


for GraphMat compared to others. For SSSP alone, Galois has bet¬ 
ter single thread performance (1.5X) compared to GraphMat as it 
runs fewer instructions (Section [53] ). However, Galois scales worse 
than GraphMat for this algorithm. In the next section, we discuss 
the reasons why GraphMat outperforms other frameworks. 

5.3 Discussion of performance 

To understand the performance of the frameworks, we performed 
a detailed analysis with hardware performance counters. Perfor¬ 
mance counters are collected for the duration of the application run 
reported in Figure]^ This approach of collecting cumulative coun¬ 
ters results in better fidelity than sampling based measurements, 
particularly when the runtimes are small. Figure shows the col¬ 
lected data for four of the applications on all the frameworks. Since 
graph analytics operations are mostly memory bandwidth and la¬ 
tency constrained, we focus on counters measuring memory per¬ 
formance. For space reasons, we present only the following key 
metrics that summarize our analysis: 

1. Instructions : Total number of instructions executed during the 

test run. 

2. Stall cycles : Total number of cycles CPU core stalled for any 

reason. Memory related reasons accounted for most of the 
stalls in our tests. 

3. Read Bandwidth : A measure of test’s memory performance. 

Write bandwidth is not shown since our tests are mostly read¬ 
intensive. 

4. Instructions per cycle (IPC) : A measure of test’s overall CPU 

efficiency. 

Of these metrics, well-performing code executes fewer instructions, 
encounters fewer stalls and achieves high read bandwidth and high 
IPC. 

In general, an increase in instruction count and lower IPC indi¬ 
cates overheads in code such as lack of vectorization (SSE, AVX), 
redundant copying of data and wasted work. Increased stall cycles 
and reduced memory bandwidth indicates memory inefficiencies 
which can be remedied through techniques like software prefetch¬ 
ing, removing indirect accesses etc. We find that GraphMat is over¬ 
all at the top (or second best) for most of these indicators. 

From Figure]^ it is clear that compared to GraphMat, GraphLab 
and CombBLAS execute significantly more instructions and have 
more stall cycles. This explains the speedup of GraphMat over both 
GraphLab and CombBLAS. Even when those frameworks achieve 
better memory bandwidth than GraphMat (e.g. Collaborative Eil- 
tering), the benefits are still offset by the increase in instruction 
count and stall cycles, implying lots of unnecessary memory loads 
and wasted work leading to overall slowdown. Galois performs 
worse than GraphMat for PageRank due to increased instruction 
and stall cycle count as well. However, Galois performs better than 
GraphMat on Triangle counting due to better IPC. Eor Collabora¬ 
tive Eiltering, GraphMat has a better IPC and performs better than 
Galois. Eor SSSP, Galois uses asynchronous execution (updated 
vertex state can be read immediately before the end of the iteration) 
and hence executes fewer instructions, leading to a 1.35X speedup 
over GraphMat. With GraphMat, the updated vertex state can be 
read only in the next iteration (bulk synchronous). 

We now discuss at a higher level the main reasons why GraphMat 
performs much better than the other frameworks. With respect to 
GraphLab, GraphMat supports a similar frontend but maps the ver¬ 
tex programs to generalized sparse matrix operations as described 
in Section 1^ This allows capturing of the global structure of the 
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Figure 6: Hardware performance counter data for different algorithms averaged over all graphs and normalized to GraphMat. The y-axis is in 
log-scale. Lower numbers are better for instructions and stall cycles. Higher numbers are better for Read bandwidth and IPC. 


matrix and allows for various optimizations including better load 
balancing, use of efficient data structures and the use of cache op¬ 
timizations such as use of global bitvectors. Moreover, similar op¬ 
erations have been well optimized by the HPC community and we 
leverage some of their work |33| . All these reasons result in more 
optimized code than a vertex program backend like GraphLab can 
achieve. 

On the other hand, CombBLAS uses a similar matrix backend 
as GraphMat, but GraphMat still performs about 7X better on av¬ 
erage. There are two primary causes for this. The first is a pro¬ 
gramming abstraction reason: GraphMat allows for vertex state to 
be accessed while processing an incoming message (as described 
in Section |4T| ), while CombBLAS disallows this. There are two 
algorithms, namely Triangle Counting and Collaborative Filtering 
where this ability is very useful both to reduce code complexity and 
to reduce runtime as discussed previously in Section [4^ 

The second reason for GraphMat to perform better than Comb¬ 
BLAS is a better backend implementation. For Pagerank, BFS, and 
SSSP, the basic operations performed in the backend are similar in 
GraphMat and CombBLAS. However, we have heavily optimized 
our generalized SPMV backend as described in Section [45] 

Compared to GraphMat, Galois’ performance differs by 1.2X. 
Galois is a sophisticated worklist management system with support 
for different problem-specific schedulers p4) , whereas GraphMat 
is built on top of sparse matrix operations. There is not a huge 
performance difference between the two frameworks on a single 
node. Extending an efficient task-queue based framework like Ga¬ 
lois to other systems (co-processors, GPU, distributed clusters etc.) 
however, is a difficult task. In contrast, sparse matrix problems are 
routinely solved on very large and diverse systems in the High Per¬ 
formance Computing world. We also note that GraphMat scales 
better than Galois with increasing core counts. Hence, we believe 


GraphMat offers an easier and more efficient path to scaling and 
extending graph analytics to diverse platforms. 

We next describe the performance impact of the optimizations 
described in Section 1451 

5.4 Effect of optimizations 

One of the key advantages of using a matrix backend is that there 
are only a few operations that dictate performance and need to be 
optimized. In the algorithms described here, most (over 80%) of 
the time is spent in the Generalized SPMV operation as described 
in Algorithmic The key optimizations performed to optimize this 
operation are described in Section [45] Figure jv] shows the perfor¬ 
mance impact of these four optimizations for Pagerank (running 
on the Facebook graph) and SSSP (running on Flickr). The first 
bar shows the baseline naive single threaded code normalized to 
1. Adding bitvectors to store the sparse vectors to improve cache 
utilization itself results in a small performance gain. However, it 
enables better parallel scalability. Using the compiler option of - 
ipo to perform inter-procedural optimization results in a significant 
gain of about 1.5X for SSSP and 1.9X for Pagerank. This third bar 
represents the best scalar code. 

The fourth and fifth bars deal with parallel scalability. The ad¬ 
dition of bit-vectors allows for a parallel scalability of about 11.7X 
and 4.7X on Pagerank and SSSP respectively (without bitvectors, 
these numbers were as low as 3.9X and 3.4X on Pagerank and 
SSSP respectively). These scalability results are multiplicative with 
the gains from ipo and bitvectors, resulting in the fourth bar. Fi¬ 
nally, load balancing optimizations result in a further gain of 1.2X 
for Pagerank and 2.8X for SSSP. This results in overall gains of 
27.3X and 19.9X from naive scalar code for Pagerank and SSSP 
respectively. Similar results were obtained for other algorithms and 
datasets as well. 
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Figure 7: Effect of optimizations performed on naive implementations 
of pagerank and single source shortest path algorithms. 


From the GraphMat user’s perspective, there is no need to un¬ 
derstand or optimize the backend in any form. In fact, there is very 
little performance tuning left to the user (the only tunable ones are 
number of threads and number of desired matrix partitions). 

To summarize, SPMV operations are heavily optimized and re¬ 
sult in better performance for all algorithms in GraphMat. If one 
considers vertex programming to be productive, there is no loss of 
productivity in using GraphMat. Compared to matrix programming 
models, there are huge productivity gains to be had. Our backend 
optimizations and frontend abstraction choices (such as the abil¬ 
ity to read vertex data while processing messages) make GraphMat 
productive without sacrificing any performance. 

6. CONCLUSION AND FUTURE WORK 

We have demonstrated GraphMat, a graph analytics framework 
that utilizes a vertex programming frontend and an optimized ma¬ 
trix backend in order to bridge the productivity-performance gap. 
We show performance improvements of 1.2-7X when compared to 
other optimized frameworks such as GraphLab, CombBLAS and 
Galois in addition to scaling better on multiple cores. GraphMat 
is about 1.2X off the performance of native, hand-optimized code 
on average. For users of graph frameworks accustomed to vertex 
programming, this provides an easy option for improving perfor¬ 
mance. Given that GraphMat is based on SPMV, we expect it to 
scale well to multiple nodes. Furthermore, improvements in single 
node efficiency translates to fewer nodes used (for a given problem 
size) and will lead to better cluster utilization. Our optimizations 
to the matrix backend can be adopted by other frameworks such as 
CombBLAS as well, leading to better performance no matter the 
choice of programming model. Our work also provides a path for 
array processing systems to support graph analytics through popu¬ 
lar vertex programming frontends. 
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APPENDIX 

We provide the GraphMat source code for Single source shortest 

path (SSSP) for reference. 

A. SSSP SOURCE CODE 

//Distance is stored as a single precision floating point 
number 

typedef float distance-type ; 

distance-type MAXJDIST = FLT_MAX; 

class SSSP : public 


GraphProgram<distance_type , distance-type , distance-type > 

{ 

//Graph Programs are templatized with 3 types — 

// 1. Message type 

111 . Processed message type (same as reduced value type) 

//3. Vertex property type 

//For SSSP, all are di stance-type 

public : 

SSSPO { 

order = OUT_EDGES; 

//Performs path traversals only via out—edges 

} 

//Send message — Read source vertex property and 
generate message 

bool send-message ( const distance_type& vertexprop , 
di stance-type& message) const { 
message = vertexprop ; 
return true ; 

} 

//Process message — Read message, edge weight, 

destination vertex property and update result ( 
processed message) 

void process-message ( const distance-type& message, 
const int edge.weight , 
const distance-type& vertexprop, 
distance_type& result) const { 
result = message + edge-weight; 

} 

//Reduction function — reduce two reduced value types a 
& b into a 

void reduce-function ( distance-type& a, 

const distance-type& b) const { 

a = std :: min(a,b); 

} 

//Apply — Read reduced value and update vertex property 
void apply(const di stance _type& reduced-value , 
di stance-type& vertexprop) { 
vertexprop = std :: min( vertexprop , reduced-value ) ; 

} 

}; 

void run-sssp ( char* filename, int nthreads) { 

//Declare graph with distance-type as vertex property 
Graph<distance-type > G; 

//Read file and create 8*nthreads partitions of matrix 
G.ReadMTX( filename , nthreads *8); 

//set all distances to infinity 
G. set All Vertexproperty (MAXJDIST) ; 

//Create instance of SSSP program 
SSSP -ssspinstance; 

// allocate workspace for GraphMat 

auto workspace = graph-program-init (-SSSpInstance , G) ; 

//Source vertex is 6. Set distance of source to 0 and 
mark it active . 
int V = 6; 

G. vertexproperty[v] = 0; 

G.active[v] = true ; 

//Run GraphMat program -ssspinstance on graph G until 
convergence 

run-graph-program(&-SSSpInstance , G, —1, &workspace); 
//G.vertexproperty is an array that holds the correct 
shortest distances from vertex 6 to all other 
vertices 

//clear workspace 
graph_program_clear (workspace ) ; 

} 
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